-
Notifications
You must be signed in to change notification settings - Fork 509
Integrate Terminal Bench Evaluation #1154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a model download command and ckpt conversion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| delegate: | ||
| - name: terminal_bench | ||
| # type: examples.eval.terminal_bench.tb_config.build_terminal_bench_config | ||
| url: http://172.17.0.1:9052 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that this port should match with the tb server in host machine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| timeout_secs: 86400 # 24 hours | ||
| max_retries: 1 # HTTP request retries from Slime to the TB server | ||
| model_name: qwen3-8b | ||
| api_base: http://127.0.1.1:30005/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that this port should match with sglang router port
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| max_retries: 1 # HTTP request retries from Slime to the TB server | ||
| model_name: qwen3-8b | ||
| api_base: http://127.0.1.1:30005/v1 | ||
| dataset_path: /mnt/data/xinyu/program/slime-tb/terminal-bench/tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: This is the dataset path in host machine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this in the quick-start README.
| ray start --head --node-ip-address ${MASTER_ADDR} --port 6380 --num-gpus 2 \ | ||
| --disable-usage-stats \ | ||
| --dashboard-host=0.0.0.0 \ | ||
| --dashboard-port=8266 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment here. About port conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this in the quick-start README.
| @@ -0,0 +1,12 @@ | |||
| # Minimal Terminal Bench delegate config for running on the host (no containers). | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used anywhere, removed.
| --ulimit stack=67108864 \ | ||
| --ulimit nofile=65536:65536 \ | ||
| -v ~/.cache:/root/.cache \ | ||
| -v $(pwd)/slime:/opt/slime \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some error when mount /opt folder in slime docker.. change to another path like /shared
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched the mount to /shared to avoid /opt issues. Thanks for pointing this out.
|
|
||
|
|
||
| @classmethod | ||
| def parse(cls, args, raw_env_config: Mapping[str, Any], defaults: Mapping[str, Any]) -> TerminalBenchConfig: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any better way to impl this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I refactored the implementation to reduce repetition by using a field to cast mapping with a loop. Please let me know if this looks reasonable.
|
LGTM. Good job. |
|
@zhuzilin Hi Zilin, I think this PR generally looks good with minimum invasions. And we've test its functionality on different machines. Do you have other suggestions? |
- Integrates **Terminal Bench** as an eval delegate for **Slime**, enabling evaluation via an external TB server. - Adds a minimal **smoke eval config** and an example **Qwen3-8B** launch script for quick end-to-end testing. - Provides client/server support for submitting eval jobs, polling status, and collecting metrics from Terminal Bench. Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com>
Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com> Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>
Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu> Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com>
Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com> Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>
Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com> Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>
1fd519d to
98facc6
Compare
⏳ To-do
|
|
I'm not sure if this is a suitable PR for slime... Because it seems mainly an introduction on how to use terminal bench to do evaluation and does not seem to show any special capability of slime. The goal of slime is not to support the evaluation of all main stream benchmarks or recommend certain evaluation pipeline. I'll close this with the same reason as #1025. |
📝 PR Description: Integrate Terminal Bench into Slime
📝 Summary
This PR fully integrates Terminal Bench (TB) into the Slime framework, enabling end-to-end agent evaluation capabilities within the system.
examples/eval/terminal_bench.eval_delegate, ensuring metrics are correctly parsed and reported to W&B.✅ Checklist
tb_server.pyimplemented (Host-side).tb_client.pyimplemented (Container-side).eval_delegate.py.⏳ To-do
The current integration targets TB v1.0 via the
tb runCLI; the workflow will be extended to support TB v2.0 based onharbor run.The TB server currently hard-codes the
terminus-2agent; agent selection will be made configurable to support additional agents.The server currently uses the default
terminal-bench-coredataset; a-d / --datasetargument will be added to enable evaluation on other registered datasets.End-to-end validation has been performed on Qwen3-8B and Qwen3-32B; evaluations will be extended to additional models.
🤝 Collaborators